<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <meta name="version" content="S5 1.1" />
  <meta name="author" content="Luiza Andrade and Sushmita Samaddar" />
  <title>Constructing data sets</title>
  <style type="text/css">
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <!-- configuration parameters -->
  <meta name="defaultView" content="slideshow" />
  <meta name="controlVis" content="hidden" />
  <!-- style sheet links -->
  <link rel="stylesheet" href="www/slides.css" type="text/css" media="projection" id="slideProj" />
  <link rel="stylesheet" href="www/outline.css" type="text/css" media="screen" id="outlineStyle" />
  <link rel="stylesheet" href="www/print.css" type="text/css" media="print" id="slidePrint" />
  <link rel="stylesheet" href="www/opera.css" type="text/css" media="projection" id="operaFix" />
  <!-- S5 JS -->
  <script src="www/slides.js" type="text/javascript"></script>
</head>
<body>
<div class="layout">
<div id="controls"></div>
<div id="currentSlide"></div>
<div id="header"></div>
<div id="footer">
  <h1></h1>
  <h2>Constructing data sets</h2>
</div>
</div>
<div class="presentation">
<div class="title-slide slide">
  <h1 class="title">Constructing data sets</h1>
  <h3 class="author">Luiza Andrade and Sushmita Samaddar</h3>
</div>
<div id="what-is-data-construction" class="slide section level1">
<h1>What is data construction</h1>
<ul>
<li>Constructing variables means <span style="color:orange">processing the data points as provided in the raw data to make them suitable for analysis</span></li>
<li>Construction transform clean data into analysis data</li>
<li>This is done by creating derived variables (e.g. dummies, indices, interactions), reshaping, combining and aggregating data sets</li>
<li>It is the only stage when changes will be made to data points (other than correcting mistakes in the data)</li>
<li>Ideally, indicator construction should be done right after data cleaning, according to the data map</li>
</ul>

</div>
<div id="what-is-data-construction-1" class="slide section level1">
<h1>What is data construction</h1>
<p><strong>Inputs</strong></p>
<ul>
<li>One or more clean data sets</li>
</ul>
<p><strong>Outputs</strong></p>
<ul>
<li>One or more constructed data sets</li>
<li>One data dictionary/codebook for each constructed data set</li>
<li>Construction documentation</li>
</ul>
<p><strong>Tasks</strong></p>
<ul>
<li>Unit of observation <span class="math inline">→</span> Unit of analysis</li>
<li>Observed measurement <span class="math inline">→</span> Analysis indicator</li>
</ul>
</div>
<div id="what-is-data-construction-2" class="slide section level1">
<h1>What is data construction</h1>
<p><strong>Why is construction a separate task from data cleaning?</strong></p>
<ul>
<li>To clearly differentiate the data originally acquired data from the result of data processing decisions</li>
</ul>
</div>
<div id="what-to-plan-ahead" class="slide section level1">
<h1>What to plan ahead</h1>
<ul>
<li>What are the <span style="color:orange">final indicators</span> needed to answer a question</li>
<li>How they are <span style="color:orange">defined</span> and calculated</li>
<li>What are the steps to get there</li>
</ul>
</div>
<div id="what-to-plan-ahead-1" class="slide section level1">
<h1>What to plan ahead</h1>
<p><img src="img/flowchart-complete.png" /></p>
</div>
<div id="what-we-will-learn-today" class="slide section level1">
<h1>What we will learn today</h1>
<ul>
<li>Two commands to <span style="color:orange">create new variables in the data</span></li>
<li>How to <span style="color:orange">aggregate observations</span></li>
<li>How to <span style="color:orange">combine data tables</span></li>
</ul>
</div>
<div id="setting-the-stage" class="slide section level1">
<h1>Setting the stage</h1>
<p><strong>Exercise</strong></p>
<p><strong>1.</strong> Launch Stata by opening the Stata project in <code>DataWork/Introduction to Stata.stpr</code></p>
<p><strong>2.</strong> Open a new do-file</p>
<p><strong>3.</strong> Load the clean process-level data</p>
</div>
<div id="setting-the-stage-1" class="slide section level1">
<h1>Setting the stage</h1>
<p><strong>Exercise</strong></p>
<p><strong>1.</strong> Launch Stata by opening the Stata project in <code>DataWork/Introduction to Stata.stpr</code></p>
<p><strong>2.</strong> Open a new do-file</p>
<p><strong>3.</strong> Load the clean process-level data</p>
<pre><code>use &quot;DataWork/Data/Clean/process_clean.dta&quot;, clear</code></pre>
</div>
<div id="create-new-numeric-variables" class="slide section level1">
<h1>Create new numeric variables</h1>
<p>Here are some simple mathematical operations that can be applied to numeric variables:</p>
<ul>
<li><strong>Addition:</strong> <code>numvar1 + numvar2</code> or <code>numvar + num</code></li>
<li><strong>Subtraction:</strong> <code>numvar1 - numvar2</code> or <code>numvar - num</code></li>
<li><strong>Multiplication:</strong> <code>numvar1 * numvar2</code> or <code>numvar * num</code></li>
<li><strong>Division:</strong> <code>numvar1 / numvar2</code> or <code>numvar / num</code></li>
</ul>
</div>
<div id="construct-numeric-variables" class="slide section level1">
<h1>Construct numeric variables</h1>
<p><strong>Exercise:</strong> use the <code>generate</code> command to create a new variable in the data set that changes the order of magnitude of contract values from 1 HKR to 100.000 HRK.</p>
</div>
<div id="construct-numeric-variables-1" class="slide section level1">
<h1>Construct numeric variables</h1>
<p><strong>Exercise:</strong> use the <code>generate</code> command to create a new variable in the data set that changes the order of magnitude of contract values from 1 HKR to 100.000 HRK.</p>
<pre><code>gen value_scaled = Process_VrijednostNabaven/100000</code></pre>
</div>
<div id="construct-date-difference-variables" class="slide section level1">
<h1>Construct date difference variables</h1>
<p><span style="font-size:10%"></span></p>
<pre><code>datediff(date1, date2, unit) </code></pre>
<p>where units can be:</p>
<ul>
<li>“day” or “d” for day</li>
<li>“hour” or “h” for hour</li>
<li>“minute”, “min”, or “m” for minute</li>
<li>“second”, “sec”, or “s” for second</li>
<li>“millisecond” or “ms” for millisecond</li>
<li>“month”, “mon”, or “m” for month</li>
<li>“year” or “y” for year</li>
</ul>
</div>
<div id="construct-date-difference-variables-1" class="slide section level1">
<h1>Construct date difference variables</h1>
<p><strong>Exercise:</strong> calculate the difference in days between the initiation of the process and the deadline for bid submission. Call it <code>bid_period</code>.</p>
</div>
<div id="construct-date-difference-variables-2" class="slide section level1">
<h1>Construct date difference variables</h1>
<p><strong>Exercise:</strong> calculate the number of days between the initiation of the process and the deadline for bid submission. Call it <code>bid_period</code>.</p>
<pre class='stata'>. gen bid_period = datediff(process_initiation_date, bid_submission_date, "day")
(23,888 missing values generated)

. summarize bid_period

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
  bid_period │      1,111    17.91539    23.40982          0        285
</pre>
</div>
<div id="extract-information-from-date-variables" class="slide section level1">
<h1>Extract information from date variables</h1>
<p>The following functions can be used to create derived variables from dates in <code>%td</code> format</p>
<ul>
<li><code>year(d)</code>: numeric year corresponding to date <code>d</code></li>
<li><code>quarter(d)</code>: numeric quarter of the year corresponding to date <code>d</code></li>
<li><code>month(d)</code>: numeric month corresponding to date <code>d</code></li>
<li><code>week(d)</code>: numeric week of the year corresponding to date <code>d</code></li>
<li><code>day(d)</code>: numeric day of the month corresponding to <code>d</code></li>
</ul>
<pre class='stata'>. gen year_init = year(process_initiation_date)
(14 missing values generated)
</pre>
</div>
<div id="extract-information-from-date-variables-1" class="slide section level1">
<h1>Extract information from date variables</h1>
<p><strong>Exercise:</strong> create two new variables:</p>
<ul>
<li><code>month_init</code>, representing the month, and</li>
<li><code>quart_init</code>, quarter when the process was initiated</li>
</ul>
</div>
<div id="extract-information-from-date-variables-2" class="slide section level1">
<h1>Extract information from date variables</h1>
<p><strong>Exercise:</strong> create two new variables:</p>
<ul>
<li><code>month_init</code>, representing the month, and</li>
<li><code>quart_init</code>, quarter when the process was initiated</li>
</ul>
<pre class='stata'>. gen month_init   = month(process_initiation_date)
(14 missing values generated)

. gen quarter_init = quarter(process_initiation_date)
(14 missing values generated)
</pre>
</div>
<div id="extract-information-from-date-variables-3" class="slide section level1">
<h1>Extract information from date variables</h1>
<ul>
<li><code>mdy(M,D,Y)</code>: the date corresponding to month M, day D, year Y</li>
<li><code>yq(Y,Q)</code>: the quarterly date corresponding to year Y, quarter Q</li>
</ul>
<p>For example, <code>gen today = mdy(10,19,2021)</code> and <code>gen today = td(19oct2021)</code> would create the same values.</p>
</div>
<div id="extract-information-from-date-variables-4" class="slide section level1">
<h1>Extract information from date variables</h1>
<p><strong>Exercise:</strong> create a new variable called <code>yquarter_init</code> that indicates the year and quarter when a procedure was initiated.</p>
</div>
<div id="extract-information-from-date-variables-5" class="slide section level1">
<h1>Extract information from date variables</h1>
<p><strong>Exercise:</strong> create a new variable called <code>yquarter_init</code> that indicates the year and quarter when a procedure was initiated.</p>
<pre class='stata'>. gen      yquarter_init = yq(year_init, quarter_init) 
(14 missing values generated)

. format   yquarter_init   %tq

. codebook yquarter_init

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
yquarter_init                                                                                                          (unlabeled)
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric quarterly date (float)

                 range:  [219,245]                    units:  1
       or equivalently:  [2014q4,2021q2]              units:  quarters
         unique values:  27                       missing .:  14/24,999

                  mean:   232.937 = 2018q1 (+ 2.8 months)
              std. dev:   7.29312

           percentiles:        10%       25%       50%       75%       90%
                               222       227       234       239       243
                            2015q3    2016q4    2018q3    2019q4    2020q4
</pre>
</div>
<div id="create-new-variables-egen" class="slide section level1">
<h1>Create new variables: <code>egen</code></h1>
<ul>
<li>Apart from the <code>generate</code> command, there is another command that is very useful when creating new variables</li>
<li>It is called <code>egen</code>, or “extended generate”</li>
<li>There is a number of functions that will only work when used with <code>egen</code></li>
<li>There are too many of them, and we will not discuss all, but you can see a complete list by typing <code>help egen</code></li>
</ul>
</div>
<div id="create-new-variables-egen-1" class="slide section level1">
<h1>Create new variables: <code>egen</code></h1>
<ul>
<li>This command is particularly useful when we want to aggregate information across rows without changing the level of observation in the data set</li>
<li>Say, for example, that we want to calculate the weight of one process on the total contracted value for an entity</li>
<li>We would need to divide the process value by the total value contracted by the entity</li>
<li><code>egen</code> allows us to calculate the total contracting value without changing the unit of observation of the data set as follows</li>
</ul>
<pre class='stata'>. egen total_value = sum(process_value), by(entity)
</pre>
</div>
<div id="create-new-variables-egen-2" class="slide section level1">
<h1>Create new variables: <code>egen</code></h1>
<p><strong>Exercise:</strong> create a new variable called <code>nr_process</code> that counts the total number of processes started by a procuring entity.</p>
<p><span style="border-left: solid 5px lightgray;padding-left: 1em;display: block;margin-block-start: 1em;margin-block-end: 1em;margin-inline-start: 40px;margin-inline-end: 40px;font-size:80%"><strong>Tip:</strong> look at <code>egen</code>’s help file to find a function that counts the number of non-missing observations for a variable.</span></p>
</div>
<div id="create-new-variables-egen-3" class="slide section level1">
<h1>Create new variables: <code>egen</code></h1>
<p><strong>Exercise:</strong> create a new variable called <code>nr_process</code> that counts the total number of processes started by a procuring entity.</p>
<pre class='stata'>. egen nr_process = count(process_value), by(entity)
</pre>
</div>
<div id="create-new-variables-best-practices" class="slide section level1">
<h1>Create new variables: best practices</h1>
<ul>
<li>Create <span style="color:orange">new variables</span> instead of overwriting the original information
<ul>
<li>That is, prefer using <code>generate</code> over <code>replace</code></li>
</ul></li>
<li>Order the data set so that related variables are close to each other</li>
</ul>
</div>
<div id="aggregating-observations-collapse" class="slide section level1">
<h1>Aggregating observations: <code>collapse</code></h1>
<ul>
<li>We have seens how <code>egen</code> allows us to aggregate observations by group without changing the unit of observation of the data set</li>
<li>Sometimes, however, we do want to change the unit of observation</li>
<li>In these cases, we will use the command <code>collapse</code></li>
<li>It works as follows:</li>
</ul>
<pre class='stata'>. collapse (sum) total_value = process_value nr_participants (count) nr_procedures = procedure_id, by(entity)
</pre>
</div>
<div id="aggregating-observations-collapse-1" class="slide section level1">
<h1>Aggregating observations: <code>collapse</code></h1>
<p>Note that running this command will <strong>completely change the data set in memory</strong></p>
<pre class='stata'>. codebook

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
entity                                                                                                                 Entity Name
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  string (str131)

         unique values:  1,892                    missing "":  0/1,892

              examples:  "GRAD MALI LOšINJ"
                         "KRALJEVAC D.O.O."
                         "OPĆINA VILJEVO"
                         "SPECIJALNA BOLNICA ZA PLUćNE BOLESTI"

               warning:  variable has embedded and trailing blanks

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
total_value                                                                                                    (sum) process_value
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (double)

                 range:  [30000,9.280e+09]            units:  1.000e-13
         unique values:  1,661                    missing .:  0/1,892

                  mean:   5.9e+07
              std. dev:   3.9e+08

           percentiles:        10%       25%       50%       75%       90%
                            422852   1.2e+06   3.8e+06   1.5e+07   6.8e+07

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
nr_participants                                                                                              (sum) nr_participants
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (double)

                 range:  [0,4465]                     units:  1
         unique values:  170                      missing .:  0/1,892

                  mean:   29.2082
              std. dev:     135.6

           percentiles:        10%       25%       50%       75%       90%
                                 1         2         7        20        50

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
nr_procedures                                                                                                 (count) procedure_id
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (long)

                 range:  [1,1925]                     units:  1
         unique values:  108                      missing .:  0/1,892

                  mean:    13.213
              std. dev:    59.712

           percentiles:        10%       25%       50%       75%       90%
                                 1         1         3         8        24
</pre>
</div>
<div id="aggregating-observations-collapse-2" class="slide section level1">
<h1>Aggregating observations: <code>collapse</code></h1>
<p>There is a number of different statistics that can be used with <code>collapse</code>, including:</p>
<ul>
<li><code>mean</code></li>
<li><code>median</code></li>
<li><code>sd</code></li>
<li><code>sum</code></li>
<li><code>count</code></li>
<li><code>percent</code></li>
<li><code>min</code></li>
<li><code>max</code></li>
</ul>
</div>
<div id="creating-aggregate-measures-best-practices" class="slide section level1">
<h1>Creating aggregate measures: best practices</h1>
<ul>
<li>Check and double-check the value assignments of questions, as well as their scales, before constructing new variables based on them</li>
<li>Look at the distributions of both the original and the constructed variables to make sure there is nothing unexpected hapenning in your data</li>
<li>Be mindful of how missing values are treated</li>
</ul>
<p><span style="border-left: solid 5px lightgray;padding-left: 1em;display: block;margin-block-start: 1em;margin-block-end: 1em;margin-inline-start: 40px;margin-inline-end: 40px;font-size:80%"><strong>Be careful:</strong> the stat <code>sum</code> treats missing values as zero. If you are adding multiple missing values, it will return zero.</span></p>
</div>
<div id="combining-data-sets" class="slide section level1">
<h1>Combining data sets</h1>
<ul>
<li>All the construction operations we have seen so far are using a single data table</li>
<li>However, we often have separate data table that we need to combine to create our final indicators</li>
<li>There are two main ways of combining data tables:
<ul>
<li><strong>Appending</strong>, when we combine data sets that have more or less the same columns, but different instances of the unit of observation to <span style="color:orange">increase the number of observations</span></li>
<li><strong>Merging</strong>, when we combine data sets that comtain more or less the same instances of the unit of observation, but different variables, to <span style="color:orange">create a data set with more columns than the original ones</span></li>
</ul></li>
</ul>
<p>Can you think of cases when you would need to perform <strong>merges</strong> or <strong>appends</strong>?</p>
</div>
<div id="combining-data-sets-1" class="slide section level1">
<h1>Combining data sets</h1>
<ul>
<li>Examples of when <span style="color:orange"><code>append</code></span> is needed include:
<ul>
<li>Combining data from different years</li>
<li>Combining data from different countries</li>
</ul></li>
<li>Examples of when <span style="color:orange"><code>merge</code></span> is needed include:
<ul>
<li>Combining data on contracts and bids for a given procedure</li>
<li>Including product information on contract data set</li>
<li>Adding entity information on a procedure data set</li>
</ul></li>
</ul>
</div>
<div id="combining-data-sets-append" class="slide section level1">
<h1>Combining data sets: <code>append</code></h1>
<p><span style="font-size:10%"></span></p>
<pre><code>append using filename [filename ...] [, options]</code></pre>
<ul>
<li>Note that now we have <span style="color:orange">more than one data set</span>, which has not happened so far</li>
<li>Stata can only have one data set in memory at once, so we need to call data sets saved as .dta (or as temporary files, but that is a more advanced topic)</li>
<li><code>append</code> will add the observations in the <strong>using</strong> data set (the one referred to by its file name) to the <strong>master</strong> data set (the one that is currently loaded in the memory)</li>
</ul>
</div>
<div id="combining-data-sets-append-1" class="slide section level1">
<h1>Combining data sets: <code>append</code></h1>
<p><span style="font-size:10%"></span></p>
<pre><code>use          &quot;DataWork/Data/Intermediate/entities_2018.dta&quot;, clear
append using &quot;DataWork/Data/Intermediate/entities_2019.dta&quot;</code></pre>

</div>
<div id="combining-data-sets-append-2" class="slide section level1">
<h1>Combining data sets: <code>append</code></h1>
<ul>
<li>When a variable is present in only one of the datasets, its values will be <strong>missing</strong> for all observations that came from the data set that did not include it</li>
<li>If columns with the same name have different types of variables in the two data sets (for example, one is a string and one is numeric), this will cause the append to fail</li>
<li>You can use the option <code>gen(newvar)</code> to include a new variable indicating from which data set each observation came</li>
<li>Note that <span style="color:orange">it only makes sense to append data sets that have <strong>the same unit of observation</strong></span>, but different instances of that unit of observation</li>
</ul>
</div>
<div id="combining-data-sets-append-3" class="slide section level1">
<h1>Combining data sets: <code>append</code></h1>
<p><strong>Exercise:</strong> explore the different options of the command <code>append</code> using the different years found in the <code>DataWork/Data/Intermediate</code> folder.</p>
</div>
<div id="combining-data-sets-merge" class="slide section level1">
<h1>Combining data sets: <code>merge</code></h1>
<p>One-to-one merge on specified key variables</p>
<pre><code>merge 1:1 varlist using filename [, options]</code></pre>
<p>Many-to-one merge on specified key variables</p>
<pre><code>merge m:1 varlist using filename [, options]</code></pre>
</div>
<div id="combining-data-sets-merge-1" class="slide section level1">
<h1>Combining data sets: <code>merge</code></h1>
<ul>
<li><span style="color:orange">It only makes sense to merge two data sets that have <strong>common observations</strong></span></li>
<li>The syntax of the command makes it clear that when merging two data sets, they <strong>need to have at least one variable in common</strong></li>
<li>This is one of the points in the data work where having <span style="color:orange">uniquely identifying variables</span> is extremely important!</li>
<li>When merging two data sets, we first need to load one of them in memory (this will be called the <strong>master</strong> data), and then refer to the other one (the <strong>using</strong> data set) by its filename</li>
</ul>
</div>
<div id="combining-data-sets-merge-2" class="slide section level1">
<h1>Combining data sets: <code>merge</code></h1>
<p><span style="font-size:10%"></span></p>
<pre><code>use &quot;DataWork/Data/Clean/process_clean.dta&quot;, clear

merge 1:m process_id using &quot;DataWork/Data/Clean/contract_clean.dta&quot;</code></pre>
<pre class='stata'>    Result                           # of obs.
    ─────────────────────────────────────────
    not matched                             1
        from master                         0  (_merge==1)
        from using                          1  (_merge==2)

    matched                            41,817  (_merge==3)
    ─────────────────────────────────────────
</pre>
</div>
<div id="combining-data-sets-merge-3" class="slide section level1">
<h1>Combining data sets: <code>merge</code></h1>
<p><span style="font-size:10%"></span></p>
<pre><code>use &quot;DataWork/Data/Clean/process_clean.dta&quot;, clear

merge 1:m process_id using &quot;DataWork/Data/Clean/contract_clean.dta&quot;</code></pre>
<p><span style="border-left: solid 5px lightgray;padding-left: 1em;display: block;margin-block-start: 1em;margin-block-end: 1em;margin-inline-start: 40px;margin-inline-end: 40px;font-size:80%">Note that Stata lets you know how well the match between the two data sets went.</span></p>
</div>
<div id="combining-data-sets-merge-4" class="slide section level1">
<h1>Combining data sets: <code>merge</code></h1>
<p>There are different options that allow you to test that the result of the merge matches your expections</p>
<p><img src="img/merge-opts.png" /></p>
</div>
<div id="where-things-go-wrong" class="slide section level1">
<h1>Where things go wrong</h1>
<ul>
<li>The more complex construction tasks involve changing the structure of the data, such as the sample and the unit of observation</li>
<li>Merges and collapses may change the number of observation and create missing entries</li>
<li>Make sure to read about how each command treats missing observations</li>
</ul>
</div>
<div id="write-pseudocode" class="slide section level1">
<h1>Write pseudocode</h1>
<ul>
<li>Describe the steps to create your indicator in plain English</li>
<li>Refine the sub-steps involved</li>
<li>When you are getting into too much detail, write code</li>
<li>Think about possible errors that may come up at each sub-step</li>
</ul>
</div>
<div id="think-about-expected-results" class="slide section level1">
<h1>Think about expected results</h1>
<ul>
<li>Think about how the command you are using treats missing values</li>
<li>Try to predict the result you will get
<ul>
<li>Will all observations merge?</li>
<li>Will the number of observations change?</li>
<li>Will missing values be created?</li>
</ul></li>
</ul>
</div>
<div id="document-the-observed-results" class="slide section level1">
<h1>Document the observed results</h1>
<ul>
<li>Explore the actual results from the operation</li>
<li>Write down in comments what happened</li>
<li>Add comments to the code explaining unexpected consequences</li>
</ul>
</div>
<div id="combining-data-sets-merge-5" class="slide section level1">
<h1>Combining data sets: <code>merge</code></h1>
<p><strong>Exercise:</strong> write pseudocode and use <code>merge</code>’s options to refine the code to merge process and contract data.</p>
<p><span style="border-left: solid 5px lightgray;padding-left: 1em;display: block;margin-block-start: 1em;margin-block-end: 1em;margin-inline-start: 40px;margin-inline-end: 40px;font-size:80%">You will later be able to re-use this code whenever you are combining data sets to make sure you are not creating problems in your data, so invest time in creating a nice piece of code.</span></p>
</div>
<div id="constructed-data-sets" class="slide section level1">
<h1>Constructed data sets</h1>
<ul>
<li>A constructed data set is <span style="color:orange">purpose-built</span> to answer an analysis question</li>
<li>Different pieces of analysis may require <span style="color:orange">different units of analysis</span>, in which case you may have as many constructed data sets as required</li>
<li>Don’t worry if you cannot create a single, “canonical” analysis data set</li>
<li>If you have multiple constructed data sets, <span style="color:orange">name them carefully</span> so you know when to use each of them</li>
</ul>
</div>
<div id="documenting-construction" class="slide section level1">
<h1>Documenting construction</h1>
<ul>
<li>Documentation is an output of construction as relevant as the code and the data</li>
<li>Someone unfamiliar with the project should be able to understand the contents of the analysis data set, and the steps taken to create them</li>
<li>Document exactly <strong>how</strong> each variable is derived or calculated</li>
<li>Carefully record how specific variables have been <strong>combined</strong>, <strong>recoded</strong>, and <strong>scaled</strong>, and refer to those records in the code</li>
<li>This can be part of a wider discussion with your team about creating <span style="color:orange">protocols for variable definition</span>, which will guarantee that indicators are defined consistently across projects</li>
</ul>
</div>
<div id="construction-documentation-should-include" class="slide section level1">
<h1>Construction documentation should include</h1>
<ul>
<li>A <span style="color:orange">data dictionary</span> with the definition of all variables in the each analysis data table</li>
<li>A codebook with <span style="color:orange">summary statistics</span> and other description of all variables in each analysis data table</li>
<li>References to the <span style="color:orange">sources</span> of variable definitions</li>
<li><span style="color:orange">Comments</span> on the code explaining each step of the construction and referring to the data dictionary and sources of variable definitions</li>
</ul>
</div>
<div id="why-is-construction-a-separate-task-from-data-analysis" class="slide section level1">
<h1>Why is construction a separate task from data analysis?</h1>
<ul>
<li>In practice, data construction often times happens at the same time as data analysis</li>
<li>As you analyze the data, different constructed variables will become necessary, as well as subsets and other alterations to the data</li>
<li>However, even if construction ends up coming before analysis only in the order the code is run, it’s important to think of them as separate steps</li>
</ul>
</div>
<div id="why-is-construction-a-separate-task-from-data-analysis-1" class="slide section level1">
<h1>Why is construction a separate task from data analysis?</h1>
<ul>
<li><strong>Maintainability:</strong> If every analysis script cleans and constructs variables from the raw data, any edits to this process need to be replicated in all scripts that use the same final variable.</li>
<li><strong>Preventing errors:</strong> The difficulty to maintain such scripts increases the chances that at least one of them will have a different sample or variable definition.</li>
</ul>
<p>Doing all variable construction and data transformation in a unified script, separate from the analysis code, helps avoid this and ensures consistency across different outputs.</p>
</div>
<div id="appendix---advanced-mathematical-operations" class="slide section level1">
<h1>Appendix - Advanced mathematical operations</h1>
<ul>
<li><strong>Logarithm:</strong> <code>ln(numvar)</code></li>
<li><strong>Square root:</strong> <code>sqrt(numvar)</code></li>
<li><strong>Exponentiation:</strong> <code>numvar^num</code></li>
</ul>
</div>
<div id="appendix---aggregating-variables" class="slide section level1">
<h1>Appendix - Aggregating variables</h1>
<p>What is the difference between the results for the following lines of code in the presence of missing values?</p>
<ul>
<li><code>gen income_total = income_wage + income_rent + income_sales</code></li>
<li><code>egen income_total = rowtotal(income_wage income_rent income_sales)</code></li>
<li><code>egen income_total = rowtotal(income_wage income_rent income_sales), m</code></li>
</ul>
</div>
</div>
</body>
</html>
